The Digits of CLIP

(March 9, 2021)
OpenAI CLIP
OpenAI Microscope
contrastive learning
Numbers and Digits
https://creativecommons.org/licenses/by/4.0/
Sifting through a single layer of OpenAI’s CLIP, examining numerical and textual neurons
2021-03-09

Top row as visualized in OpenAI Microscope. Bottom row images produced via SIREN+CLIP, as pioneered by Ryan Murdock

What is CLIP?

In case you are new to CLIP, please first visit the fantastic introductions to OpenAI’s very successful contrastive learning models in 2021: blog announcement, analyzing its multimodal neurons, and the official paper.

This Post

This blog is not a systematic study – rather a simple table of contents from visiting the OpenAI’s Microscope pages. Concretely, I spent a few hours staring at all of the 2,560 entries in Layer 4/4/Add_6 of the mid-sized RN50-x4 model. My labels are just personal impressions, so take them with a large grain of salt.

My main interest was understanding how CLIP sees numbers, and – if applicable – mathematical syntax – since that’s my own corner of the woods. My list includes various other curious categories I ran into, but I definitely did not include everything here. There are overall ≈270 neurons listed here, or ≈10% from that single layer, in the mid-sized CLIP model. A lot more is available for the curious to explore. Thanks to OpenAI for making all of this possible!

To Preview: make sure to hover on each link to fetch a preview image directly from the explorer. This can be a bit slow at times and apologies in advance if you’re reading here with the previews completely broken – they are not part of the post and will disappear the moment OpenAI modifies its URL scheme. On the robust side, clicking on the link will navigate you to the official microscope page in a new tab.

Numbers, numbers, numbers

Numbers in context

Brief Discussion

It appears that CLIP has a variety of neurons concerned with digit detection. I seem to have collected at least one neuron concerned with each individual digit, as well as at least one neuron that is interested in the category of “a digit” which exhibits all ten.

There is also a set of detectors for handwritten digits – e.g. on chalkboards, notebooks and parchment. Several cases appear to spot a particular length of number (e.g. between 4 and 6 digits), or a particular aspect ratio - centered in the image, or spanning the entire image width. Another neuron looks for vertically ordered digits.

Units, quantities and measurements are detected separately – but curiously grouped by certain unit kinds, such as weight or length. Also separate are clock faces, digital and analog.

A couple of neurons even grasp for basic algebraic syntax. The basic algebra capacity has not gone far, and appears shared between visually related syntax with a completely different meaning – e.g. subtraction is matched together with compound identifiers such as ISBN numbers, phone numbers and simple chemical formulas.

Some numbers seem to appear as secondary markers to contextual items – detecting jerseys, calendar dates, rulers. Lastly, it is also fascinating to see some neurons being shared with seemingly unrelated categories – goatees, puppies, tails, small flames, belly buttons – for what may be rarer specialized cases.

Possibly broader STEM

Characters / Text

Sampling everything else